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ABSTRACT 

Questions often arise spontaneously in a curious mind, due to an observation about a new or unknown environment. 
When an expert is right there, prepared to engage in dialog, this curiosity can be harnessed and converted into highly 
effective, intrinsically motivated learning. This paper investigates how this kind of situated informal learning can be 
realized in real-world settings with wearable technologies and the support of a remote learning companion. In particular, 
we seek to understand how the use of different multimedia communication mediums impacts the quality of the interaction 
with a remote teacher, and how these remote interactions compare with face-to-face, co-present learning. A prototype 
system called TagAlong was developed with attention to features that facilitate dialog based on the visual environment. It 
was developed to work robustly in the wild, depending only on widely-available components and infrastructure. A pilot 
study was performed to learn about what characteristics are most important for successful interactions, as a basis for 
further system development and a future full-scale study. We conclude that it is critical for system design to be informed 
by (i) an analysis of the attentional burdens imposed by the system on both wearer and companion and (ii) a knowledge 
of the strengths and weaknesses of co-present learning. 
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1. INTRODUCTION 

Much of the practical and tacit everyday knowledge employed in workplaces is acquired on the job, through 
learning-by-doing. One reason why this type of learning is highly effective is that knowledgeable coworkers 
can be found nearby to not only assist in completing tasks at hand, but also to subsequently engage in broader 
discourse, in which ideas from specific tasks are generalized and abstracted. That is, they use these task 
examples as props to explain ideas that generalize to other tasks, explain the reasons behind procedures, and 
so on. The physical elements at hand provide a point of reference for the learner to become curious and ask 
follow-up questions, based not only on what the expert has chosen to highlight, but also on her independent 
observations of the environment and apparatus. With a shared physical environment as a precondition, this 
type of learning episode requires three attributes: the immediate need for assistance as impetus to start the 
conversation, an expert available to assist, and the opportunity to seamlessly transition from an 
assistance-focused dialog to a broader discussion through which deep knowledge exchange can happen. 

We seek to broaden the applicability of this powerful episodic learning model in two ways. Firstly, we 
seek to make it possible to learn in this way from an expert who is remote instead of co-present. This would 
dramatically increase the reach of such interactions, so they could happen at any time in any physical or 
geographic location. Secondly, we aim to apply this model not just to assistance and learning in workplaces 
or communities of practice, but also to the myriad other contexts where curiosity may arise, not only out of 
necessity. In general, this means engaging in dialog to answer and expound on questions that arise due to the 
immediate physical surroundings. By supporting informal and exploratory learning in this way, we can move 
towards a world where deep (human) learning can happen anywhere and everywhere, driven by the intrinsic 
motivation of curiosity. 
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In this paper, we seek to investigate specifically what kind of system is necessary to facilitate a fluid 
dialog between learner and remote expert, allowing each to make reference to specific physical objects in the 
learner’s environment? Can such a system be made effective without being cumbersome or requiring 
distracting device interactions? Can this be achieved today, building only on readily available devices and 
infrastructure? This represents an exploratory phase of research, the goal of which is to inform the design of 
future systems, as well as identify specific questions for future research, including full-scale studies that 
employ such systems. 

We designed a prototype system called TagAlong and performed a qualitative pilot study in which the 
same task (discussing fine artwork) was performed with three different communication mediums: (i) 
co-present (oral and gestural) communication, (ii) mobile phone-based video chat, and (iii) using the 
TagAlong prototype system, utilizing synchronous audio in combination with still image capture and 
annotation. 

This paper is structured as follows: Section 2 discusses prior work and positions TagAlong within this 
framework. This allows us to build on what is already known, as well as highlight the challenges that are 
particular to our use case. Then, Section 3 presents the design goals for the TagAlong prototype system, 
initial design, and enhancements based on the results of preliminary testing. Section 4 describes a pilot study 
based on a learning scenario in an art museum. Section 5 talks about implications for future work, and 
finally. Section 6 presents concluding remarks. 


2. RELATED WORK 

This work is situated at the intersection of learning, communication technology and wearable technology. 
From a learning perspective, the theoretical underpinnings are based on the concepts of informal learning, 
situated learning and contextual memory. Informal learning happens through curiosity or necessity within a 
social or experiential context, and is unintentional from the perspective of the learner. Situated learning refers 
to the acquisition of knowledge relevant to needs or actions at hand. In the field of language learning it has 
been shown that students are more receptive to learning relevant vocabulary and phrases in such 
circumstances (Brown 1987). Closely related is the theory of contextual memory, which holds that a person 
is more likely to recall information when situated in a context analogous to the one where they were 
originally exposed to it. This is explained by the presence of similar memory cues both at the time of 
exposure and the time of retrieval (Tulving and Thomson 1973; Davies and Thomson 1988). 

On the communication technology side some research has focused on use cases for collaboration with 
shared subject matter, evaluating the usefulness of different communication capabilities and mediums. For 
instance, Chastine et al. (Chastine et al. 2007 ) have looked at different configurations of physical and virtual 
object representations in a collaborative 3D task, and investigated the impact of these representations in a 
virtual environment. They found that a fundamental requirement is the ability for users to effectively refer to 
artifacts within the shared environment. Ochsman et al. (Ochsman and Chapanis 1974) have investigated the 
comparative value of audio, video, and text channels to support cooperative problem solving, concluding that 
the audio channel is the most critical medium. 

Other research has focused on audio/video conferencing use cases where the communication medium 
primarily carries voice and video of the participants themselves. Isaacs et al. (Isaacs and Tang 1994) 
evaluated video conferencing as compared to audio calling and concluded that there is significant social 
value to seeing those you interact with, especially when the purpose is professional team-building. It 
facilitates interpreting non-verbal information, noticing peripheral cues and expressing attitudes. 

Even though video offers a rich communication medium and provides several advantages for interpreting 
non-verbal information between collaborators when using desktop or personal computers, in the case of task 
assistance or situated learning via a wearable devices, this might not be the case. For example, video of the 
face of the user using a wearable device would be both difficult to capture and not add much value when the 
purpose is to communicate about the environment. Careful attention must be given to what behaviors are 
supported through the choice of medium and other affordances of the system. 

Next we consider related work in wearable device interaction including assistance systems and 
telepresence. The development of wearable computing was just burgeoning when a seminal work carried out 
by Starner et al. (Starner et al. 1997) put forth the concept of computing that proactively assists a user. Early 
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work such as that carried out by Feiner et al. (Feiner et al. 1997) looked specifically at how information could 
be presented to users with wearable computers - highlighting the fact that the type of interactions required by 
such systems are completely different from the ones required by desktops or mainframe computers. 

Thereafter, researchers began to investigate how wearable systems could be used to facilitate remote 
collaboration since, by their nature, they were able to track and convey information about the wearer’s 
surrounding environment. For example, early work carried out by Mann (S. Mann 2000) shows that a mobile 
system that gives very simple feedback in the wearer’s environment (a laser dot) can effectively be used to 
experience visual collaborative telepresence. Among the disadvantages of this system are that (i) it is limited 
to very simple feedback about the environment, (ii) this feedback is highly ephemeral, since any movement 
by either wearer or companion results in moving the laser dot off target, and (iii) it is quite obtrusive, if not 
dangerous, to those individuals in the immediate surroundings of the system due its utilization of a laser 
pointer. Building on the TelePointer concept, Gurevich et al. (Gurevich et al. 2012) developed a system for 
more sophisticated feedback called TeleAdvisor. TeleAdvisor is a stationary system that gives visual 
feedback using a projector mounted on a robotic arm. It enables a remote helper to view and interact with the 
workers’ workspace, while controlling the point of view. 


3. TAGALONG SYSTEM 

The TagAlong system is a mobile context-sharing system that runs on Google Glass and a mobile phone or 
tablet. The system wearer can send still images on-demand to a remote companion or teacher, who can then 
reply by annotating the source material and sending it back. Synchronous interaction is optionally supported 
through the use of a real-time audio channel. In this section we describe the design goals, interaction design, 
and architectural decisions that make the system work robustly in the wild. 

3.1 Design Goals 

As outlined above, the TagAlong prototype system is intended to be both well-adapted to facilitating dialog 
about the wearer’s physical surroundings and usable in everyday settings. The latter requirement goes well 
beyond what is necessary just to perform our experiment. Designing and building the system to be usable in 
everyday settings makes our experimental results significant when considering (i) what kinds of systems can 
be built into an everyday usage flow, without assuming mass adoption, and (ii) what device technology and 
infrastructure are already here today. This way we can comment on how this will change in the immediate 
term, as well as where it is most important to invest effort in both of these areas. Accordingly, our design 
goals are as follows: 

• Create a system that can be worn continuously and with minimal burden by the wearer, and can allow 
both users (the wearer and the companion) interact with the system in a mobile setting. 

• On the wearer side, the system should operate in the background requiring little of her attention. Giving 
input to the system should incur little startup cost, and when information from companion wearer 
arrives, it should be noticeable but not disruptive. 

• Support synchronous interaction, so as to reap maximum benefit from high -engagement interactions 
with the companion. 

• Additionally, support asynchronous interactions. There is a spectrum from low-engagement, 
low-bandwidth communication to high -engagement, high-bandwidth communication that needs to be 
supported for in-the-wild usage. 

• Allowing anyone with a smartphone to play the role of the companion, and accordingly develop a flow 
for initiating interactions that is natural and requires a minimal amount of effort. This dramatically 
increases the reach of real-world usage of the prototype system. 
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3.2 Interaction Design 


There are two user roles in the TagAlong system: (1) the wearer, who uses a Google Glass connected via 
Bluetooth to an Android mobile device, (2) the companion, who is any person with a smartphone web 
browser that the wearer connects with, as shown in Figure 1. 

When the wearer wishes to interact with a remote companion, he uses the app to send a text message to 
the desired companion to ask her to “connect”, or “opt-in” to a TagAlong session. As such, the session 
through which the exchange of visual information happens is asynchronous and can last indefinitely (hours, 
days, or weeks). The text message contains a link which, when clicked, opens a page that describes what the 
companion is opting in to, and then allows her to confirm. Figure 2 shows the registration and connection 
process for the TagAlong users. For interactions with synchronous audio, the wearer can call the companion 
using the carrier network. 

When the wearer wishes to take action by sending an image, she must press a button once to enter a 
view-finder mode that opens a camera preview so that the wearer may take a picture using Google Glass. The 
wearer presses a button to take a picture, and the companion receives a text message containing a url. 
Clicking on the url causes opens up the image in a browser window and allows the companion to annotate 
and edit the image. The companion can then send back the annotated image, which is displayed on the 
wearer’s device and accompanied by an audio chime. 



Figure 1. Basic Information Flow in TagAlong System 


Figure 2. Interaction Initiation Flow 


3.3 TagAlong Wearer UI Enhancements 

Exploratory usage of the initial implementation of TagAlong exposed a number of usability problems in the 
wearer interface. We describe these problems and the features we introduced to address them. 

3.3.1 Four-button Input 

We found the native Glass touchpad to be both unreliable at detecting input events, as well as awkward to 
operate in public settings (this awkwardness was heightened by the unreliability of event detection, when 
repeated attempts needed to be made). For this reason we used a wireless slide changer remote as an input 
device. This device is unobtrusive, can be easily stowed in a pocket or purse, and offers tactile feedback to 
support eyes-free operation. 

3.3.2 Status Notifications 

In early trials of our system, a lack of feedback for the wearer left them uncertain of the state of the system. 
We had used audio chimes to notify the user of progress - such as an image being captured aor successfully 
uploaded to the server. In moderately noisy environments or with moderate attentional loads, it was easy to 
miss these updates. Our enhanced design uses visual status indicators to indicating whether a message has 
been sent and seen by the receiver. In exploratory trials, users reported better usability when status messages 
were used. 
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3.3.3 Viewfinder 

Another important user interface choice was to use a viewfinder. The canonical use of the Google Glass 
camera takes a picture without showing a live camera preview. In our exploratory trials, users complained 
that they could not frame the picture sent to the remote companion, and this limited the effectiveness of these 
pictures at communicating the object of their attention. Introducing a viewfinder allows the user to know 
immediately both that the device is attempting to take a picture, and that the field of view is scoped and 
aligned as desired. 


4. PILOT STUDY: LEARNING FROM AN EXPERT 

The goal of this study was to understand how the fully-mobile TagAlong system compares with video 
streaming and face-to-face communication in terms of effectiveness at communicating about the visual 
environment for learning purposes. The basic activity in the study is an informal learning dialog about a work 
of art between an art expert and a novice. 

4.1 Setup 

A remote art expert interacts informally with a non -expert to convey knowledge about a specific work of art, 
which the non-expert is in the presence of. No guidance is given to either party about how to initiate the 
dialog. They interact using one of three conditions (i) TagAlong with live audio, (ii) video streaming on a 
mobile phone with live audio, and (iii) co-present, face-to-face interaction. In the dialog that they have, either 
party can determine the subject, questions can be asked, and clarifications requested. 



Figure 3. Wearer in Pilot Study Figure 4. Companion in Pilot Study 


In the TagAlong system condition, participants are connected with a live audio stream, and can use the 
system to exchange images and annotations of the subject matter in the wearer’s environment as shown in 
Figure 3 and Figure 4. The wearer’s interaction is hands-optional, since she only needs to use her hands when 
she wants to send an image. No physical posture change is required for her to shift focus between the 
system’s visual feedback and the world, and the input device can be operated eyes-free. 

In the second condition, the “wearer” streams video from a hand-held mobile phone. The wearer and 
companion are once again connected with a live audio stream. The video stream of the wearer’s rear-facing 
phone camera is previewed on-screen and streamed real-time to the companion’s mobile phone. There is no 
affordance for spatial annotation in the interface, but the wearer can physically point with his free hand to 
indicate an object of interest. 

In the third condition, learner and expert are co-present and communicate face-to-face. The two stand 
next to each other in front of the painting and discuss. 
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4.2 Procedure 

We tested a total of four wearers, each with one of two art expert companions. The wearers were each 
involved in two trials, and in each trial, the wearer and companion would discuss two works of art for five 
minutes each. For each wearer, one of the two trials was TagAlong, and for the other trial, two participants 
tried video streaming, while the other two tried face-to-face explanation. For each expert, the four art pieces 
stayed the same across trials, but the order of the trial conditions was switched in subsequent trials with 
respect to the TagAlong trial. 

After both trials were completed, an in-depth interview lasting between 5 and 15 minutes was conducted 
with both wearer and companion present. In addition, after each expert’s second trial, a solo follow-up 
interview was performed, where she was asked to contrast the experience of explaining the same work of art 
in the two different conditions, considering each work of art individually. 

4.3 Results 

First we compare results from the TagAlong system with video streaming, and then with the co-present 
learning condition. 

4.3.1 Still Images vs. Streaming Video 

Still images and streaming video each presented their own advantages and disadvantages. Still images from 
TagAlong provided a clear advantage over streaming video as a vehicle for detailed and persistent 
annotations by the companion. The companion could circle, underline, outline, and label with text specific 
visual elements. In the video streaming condition, companions needed to use verbal cues and gestures to 
draw the wearer’s attention to particular subject matter in order to explicate it. Once a subject was identified, 
a verbal description was needed to make detailed comments. As a corollary, the ability to freeze a subject 
matter allowed the dialog to be more focused. With moving images, the companion felt she had to follow the 
dynamic whim of the wearer, whereas with still images, her highlighting specific details also caused the 
wearer to stay still long enough to focus on those details. 

A disadvantage of still images in our implementation was that the companion was limited to annotating 
only the most recent image taken by the wearer. When this was a close-up, she no longer had an effective 
way of suggesting the next subject of focus. One way of addressing this limitation would be to allow the 
companion to refer back to previous (less close-up) images to suggest the next focal point (e.g. as shown by 
Greenwald, et al (Greenwald, Khan, and Maes 2015)). 

Video streaming was clearly advantageous over still image exchange in terms of responsiveness. One 
wearer commented about using video-streaming when discussing the sculpture: 

I could show her what 1 was talking about it real-time. It was a smoother process than taking 
pictures all the time. 

That is, when the subject was a sculpture, the wearer needed to move around to find interesting 
viewpoints, and having a live stream was advantageous. 

Even so, there was agreement among participants that which system was better depended on subject 
matter. As one art expert participant expressed: 

For Matta [painting] I prefer cards [still images]. I could actually point out the things that I 
thought were interesting ... [for the sculpture] I liked video actually, I think that has a lot to do 
with the subject matter. 

That is, when the subject was a painting with different objects and details to be highlighted, the ability to 
circle and point at them was more important. 

We propose that the interfaces we experimented with are best understood as performing two separate but 
overlapping functions - first, giving the companion some level of situational awareness about the physical 
surroundings and focus of attention of the wearer, and second, creating a locus of attention that is shared and 
can be modified by one party or the other. A video feed is constantly being moved by the wearer and gives 
the companion no control over the shared locus of attention other than to intervene verbally to highlight 


24 



12th International Conference on Cognition and Exploratory Learning in Digital Age (CELDA 2015) 


visual landmarks. In the case of TagAlong, the wearer selects the frame, and the companion can select a 
subregion and annotate by sketching. 

4.3.2 Co-Present vs Remote Learning 

One assumption we made going into this study was that “being there”, i.e., co-present learning, is always the 
best. Perhaps the most unexpected insight we gained was that this is not necessarily the case. Indeed 
participants in the face-to-face condition noted the comparatively more natural and intuitive way that gestures 
could be used for both pointing and expression. However, they highlighted that using mobile devices to 
interact remotely allowed them to focus more exclusively on the work of art. It seems that this is related to 
the social burden of face-to-face interaction, or the need to “entertain” as one participant described it. 

I felt a little more like I was watching TV, with the Glass on, drowning into this painting while 
she’s talking. But there [in person] it felt more like I was trying to entertain, hold a 
conversation, smile, get a laugh out of you [addressing companion]. 

Face-to-face interaction carries with it the burden of proxemics- the ensemble of body and facial gestures 
and eye contact that must be constantly maintained during co-present social interaction. Eliminating that 
burden liberates the learner’s attention to focus only on the subject matter. Participants, including experts and 
wearers alike, consistently echoed the sentiment that they were highly focused during the TagAlong system 
interaction in comparison to the feeling of having many distractions while the face-to-face discussion was 
taking place. This seems to support the claim that the maximum amount of attention was available for the 
artwork itself in the hands-free, remote condition (TagAlong). 

These results show that co-presence is an important point of reference which we can use to understand 
and predict what will work well. We can frame future work in terms of imitating co-presence in a targeted 
way. For example, we do wish to emulate the ability of either party to draw attention to a point in the 
environment. We do not wish to emulate the attentional burden of face-to-face social interaction. The 
generalization is that co-presence provides a wonderful set of affordances for two people to communicate in 
high-bandwidth. What it does not do is allow us to selectively switch off some of those affordances in order 
to achieve greater focus for specific tasks. In essence, future work in this area concerns identifying ways of 
learning that are better than co-present, face-to-face interactions within certain contexts or with certain 
specific purposes in mind. 


5. FUTURE WORK 

The above results point towards some specific improvements to TagAlong-like systems in the immediate 
term, as well as some challenging ones for the longer term. 

There is a straightforward concept for designing a system that imparts the powerful feeling of 
synchronous visual presence that we saw with video streaming, but also affords the important ability to 
annotate specific objects which we saw when using still images. In a hybrid system, this would be to have 
both live streaming and annotation at the same time. A split screen, or swappable picture-in-picture interface 
could be used to maintain both real-time awareness, and the ability for the companion to suggest or define a 
locus of attention. 

Although the wearer is able to “point” by framing a still image and speaking over it, the ability to engage 
more directly in a dialog of annotation with the companion is something that would be sure to add expressive 
power for the wearer and hence make dialogues richer. The challenge would be to maintain the same low 
level of attention required for operating the system. Some candidate input methods would be Live Trace 
(Colafo et al. 2013), which uses a depth sensor to allow the wearer to lasso environmental objects using a 
gesture at arms length in front of the face; the Nod ring, which uses a ring-mounted IMU to create 2D or 3D 
input signals from free hand movements, and the Thalmic Labs Myo, which interprets a small discrete set of 
hand gestures, in addition to including an IMU that could be used similarly to the Nod. 

Our results also pointed to the need for the companion to maintain the broadest possible representation of 
the environment, so that he can highlight subjects that are not currently being attended to by the wearer. This 
may be done by compiling all the data explicitly sent by the wearer device, but could also in general include 
reference information that could be externally retrieved. For example, in the case of the art museum, the 
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companion could be provided with a map of the museum, as well as high -resolution representations of all the 
art within it. With additional information, the expert companion would not be limited to just the 
comparatively low-quality images provided by the wearer. The latter would only be used to create situation 
awareness for the companion about what the wearer is currently attending to. In another use case, like 
navigating streets, maps archives like Google Maps may be used to invoke to points of reference that haven’t 
yet been visited by the wearer. 

The overarching challenge in all of this future work is to avoid confusion or attentional burden when 
making systems with these hybrid assemblages of content which is streaming and frozen, past or present, 
overlaid or peripheral, from internal and external sources, and so on. 


6. CONCLUSION 

In the present work we have demonstrated that individual mediums for learning-focused discourse have 
advantages and disadvantages based on what the subject matter is, and who is taking part in the dialog. Still 
images are good for making detailed reference to static elements of the environment, as we saw with the 
example of paintings. Video streaming, on the other hand, appears better when physical movement is 
important to the exploratory activity - in our example, viewing sculptures from different angles. Considering 
co-present versus remote teaching, in some cases face-to-face behaviors of like gesturing and making eye 
contact are helpful, but in other cases they can distract from the subject matter. In summary, our results 
support the claim that, in order to maximize the effectiveness of remote interactions, a carefully assembled 
melange of mediums should be selected for particular use cases. 

Moving forward along this path, we envision a world where seeking input from a remote expert will be as 
easy as tapping an office colleague on the shoulder. Tomorrow’s TagAlong-like systems will utilize 
numerous technologies to create ever more vivid glimpses into remote environments and the state of the 
those who occupy them. High -resolution 3D capture and display will make the environments seem real. 
Real-time computer vision applied to these data streams will make it possible for the companion to identify 
and annotate environmental objects in a way that is fast and persistent. Labels and annotations can adapt to 
changes in the environment. Input and output may take many non-visual forms. For instance, remote sports 
instruction might use EMG data to inform the companion how the wearer is moving, and muscle stimulation 
allows her intervene with correct motions. In the present work we haven’t even scratched the surface of more 
exotic forms of input and output, such as those just mentioned, and the challenges we encountered will be 
compounded when these are brought into the mix. On one hand this means fruitful grounds for future 
research, and on the other it calls for a principled approach, since we will otherwise be overwhelmed by the 
combinatorial complexity of the design space, and corresponding difficulty of finding good designs. 
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